Extracting information from PHE blogs and publications

Julian Flowers

2018-04-12

PHE publishes a range of information to https://www.gov.uk/phe inlcuding blogs, statistics and other publications.

This vignette introduces an internal PHE R package myScrapers which includes tools to extract published public health information from:

The functions offered include

These tools are intended to help text mining and analysis of published content, to assist developing outputs such as products and services catalogues, and to make it easier to identify public health content at NHS Digital.

The functions are assembled as an R package which is currently available on github at https://github.com/julianflowers/myScrapers.

Getting started

The first step is to install the package. This can be achieved with the code below:

library(devtools)
if(!require("myScrapers"))install_github("julianflowers/myScrapers", build_vignettes =TRUE)
library(myScrapers)

Public health content on NHS Digital

Identified PH content is available from https://digital.nhs.uk/article/4375/Public-health which give the following page which is a search spread over 3 pages.

We can set up the URLs to be scraped:


url <- "https://digital.nhs.uk/article/4375/Public-health"
url1 <- paste0(url ,"?p=", 2:3)
links <- c(url, url1)
links
#> [1] "https://digital.nhs.uk/article/4375/Public-health"    
#> [2] "https://digital.nhs.uk/article/4375/Public-health?p=2"
#> [3] "https://digital.nhs.uk/article/4375/Public-health?p=3"

Then use the get_nhsd_pubs function to identify a list of relevant publications. We use the purrr::map function to apply the extraction function to each page.

ph_pubs <- map(links, function(x) get_nhsd_pubs(x))

test <- flatten(ph_pubs) %>% na.omit()

test[1:3]
#> [[1]]
#> [1] "https://digital.nhs.uk/catalogue/PUB30258"
#> 
#> [[2]]
#> [1] "https://digital.nhs.uk/catalogue/PUB30256"
#> 
#> [[3]]
#> [1] "https://digital.nhs.uk/catalogue/PUB30241"

Next, we can identify the links for spreadsheets from these pages, and also extract metadata and titles and create a tabular output.

# Identify spreadsheets, headers and descriptions

xls <- map(test, function(x) get_nhsd_sheets(x))
xls1 <- map(xls, data.frame)
xls2 <- map_df(xls1, bind_rows)
colnames(xls2) <- c("xls", "url")

head <- map(test, function(x) get_nhsd_header(x)) 
head1 <- map(head, data.frame)
head2 <- map_df(head1, bind_rows)
colnames(head2) <- c("title", "url")

meta <- map(test, function(x) get_nhsd_metadata(x))
meta1 <- map(meta, data.frame)
meta2 <- map_df(meta1, bind_rows)
colnames(meta2) <- c("description", "url")

left_join(head2, meta2) %>%
  left_join(xls2) %>%
  mutate(xls = paste0("<a href =",  xls,  ">xls</a>")) %>%
  mutate(description = str_replace_all(description, "Summary", "")) %>%
  DT::datatable(rownames = FALSE, escape = FALSE, filter = "top", extensions = 'Buttons', options = list(
     
    columnDefs = list(list(autowidth = TRUE, width = '10%', targets = list(1, 2))),
    scrollX = TRUE, 
    pageLength = 25, 
    dom = 'Bfrtip',
    buttons = c('csv', 'excel', 'pdf'))) 
#> Joining, by = "url"
#> Joining, by = "url"

PHE Blogs

The URL for PHE blogs is “https://publichealthmatters.blog.gov.uk”.

We can obtain a table of blog topic categories or authors…


url <- "https://publichealthmatters.blog.gov.uk"
url1 <- paste0(url,  "/page/", 2:73)
urls <- c(url, url1)

cats <- get_blog_categories(urls)
 
cats
#>  [1] "duncan-selbie-friday-message"  "hwb"                          
#>  [3] "hp"                            "phes-science"                 
#>  [5] "priority2"                     "local-authority-public-health"
#>  [7] "health-improvement"            "health-matters"               
#>  [9] "health-and-work"               "disease-detectives"           
#> [11] "data-blog"                     "digital"                      
#> [13] "uncategorized"                 "cko"                          
#> [15] "science-hub"                   "london-region"                
#> [17] "priority3"                     "global-health"                
#> [19] "northern-region"               "health-profile-for-england"   
#> [21] "priority5"                     "phe-announcement"             
#> [23] "phe-people"                    "health-in-a-changing-climate" 
#> [25] "health-visitors"               "the-week-at-phe"              
#> [27] "priority6"                     "nursing"                      
#> [29] "priority4"                     "midlands-and-east-of-england" 
#> [31] "midwifery"                     "priority1"                    
#> [33] "priority7"                     "microbiology-services"        
#> [35] "mythbuster"                    "ab"

url <- "https://publichealthmatters.blog.gov.uk"

authors <- get_blog_authors(url)
 
authors
#>   [1] "duncan-selbie"                 "kate-folkard"                 
#>   [3] "dominik-zenner"                "binta-hanakuka"               
#>   [5] "brian-ferguson"                "peter-kelly"                  
#>   [7] "local-authority-public-health" "martin-dockrell"              
#>   [9] "ann-marie-connolly"            "jane-south"                   
#>  [11] "carla-brown"                   "catherine-hayes"              
#>  [13] "eamonn-omoore"                 "jo-peden"                     
#>  [15] "justin-varney"                 "alison-tedstone"              
#>  [17] "martin-ward-platt"             "jude-stansfield"              
#>  [19] "mary-bythell"                  "jamie-waterall"               
#>  [21] "jennifer-lloyd"                "leoni-belsman"                
#>  [23] "rosanna-oconnor"               "dave-jephson"                 
#>  [25] "justine-fitzpatrick"           "clare-griffiths"              
#>  [27] "eustace-de-sousa"              "claire-robson"                
#>  [29] "emma-seria-walker"             "linda-hindle"                 
#>  [31] "helen-christmas"               "charlotte-fellows"            
#>  [33] "james-westwood"                "john-newton"                  
#>  [35] "blog-editor"                   "diarmaid-crean"               
#>  [37] "jane-leaman"                   "vanessa-saliba"               
#>  [39] "lucy-elliss-brookes"           "ravi-jaipaul"                 
#>  [41] "paul-cosford"                  "lucy-young"                   
#>  [43] "liz-price"                     "jem-rashbass"                 
#>  [45] "ruth-swann"                    "richard-gleave"               
#>  [47] "yvonne-doyle"                  "valerie-delpech"              
#>  [49] "louis-levy"                    "sema-mandal"                  
#>  [51] "anthony-nardone"               "diane-ashiru-oredope"         
#>  [53] "clare-heaviside"               "dame-carol-black"             
#>  [55] "shaun-donaghy"                 "jim-mcmanus"                  
#>  [57] "sandra-white"                  "paul-johnstone"               
#>  [59] "helen-garnham"                 "gregor-henderson"             
#>  [61] "russ-moody"                    "david-green"                  
#>  [63] "10"                            "sacha-wyke"                   
#>  [65] "allan-baker"                   "sonia-gill"                   
#>  [67] "marilena-korkodilos"           "jenny-harries"                
#>  [69] "matthew-day"                   "nino-manddalena"              
#>  [71] "sarah-smith"                   "tim-gant"                     
#>  [73] "tony-vickers-byrne"            "anne-mackie"                  
#>  [75] "christine-mccartney"           "simon-thomas"                 
#>  [77] "chloe-johnson"                 "claire-laurent"               
#>  [79] "08"                            "ruth-stubbs"                  
#>  [81] "mohammed-vaqar"                "helen-harris"                 
#>  [83] "alison-keating"                "faith-ege"                    
#>  [85] "mel-sirotkin"                  "jamie-blackshaw"              
#>  [87] "sophie-newbound"               "michael-brodie"               
#>  [89] "joanne-yarwood"                "john-ohagan"                  
#>  [91] "elizabeth-castle"              "tim-chadborn"                 
#>  [93] "kevin-fenton"                  "leonie-prasad"                
#>  [95] "panos-zerdevas"                "sarah-anderson"               
#>  [97] "clare-perkins"                 "derren-ready"                 
#>  [99] "david-pencheon"                "viv-bennett"                  
#> [101] "craig-wright"                  "maxine-johnson"               
#> [103] "polly-kwok"                    "sabrina-safo"                 
#> [105] "mary-e-black"                  "lucy-irvine"                  
#> [107] "john-robson"                   "henrik-moller"                
#> [109] "katherine-henson"              "nuzhat-ali"                   
#> [111] "owen-brigstock-barron"         "gul-root"                     
#> [113] "natasha-roberts"               "ruth-simmons"                 
#> [115] "sean-mcphail"                  "stephen-morton"               
#> [117] "nicholas-coyle"                "wendy-nicholson"              
#> [119] "elaine-rashbrook"              "charles-alessi"               
#> [121] "mike-gent"                     "geraldine-strathdee"          
#> [123] "karen-shaw"                    "virginia-murray"              
#> [125] "shona-arora"                   "chris-hatton"                 
#> [127] "matt-kearney"                  "james-charnock"               
#> [129] "carl-petrokofsky"              "matt-hennessey"               
#> [131] "paul-crook"                    "angie-bone"                   
#> [133] "rebecca-hams"                  "jill-meara"                   
#> [135] "paul-hamblin"                  "darryl-quantz"                
#> [137] "jenifer-smith"                 "alison-hadley"                
#> [139] "bernd-eggen"                   "mary"                         
#> [141] "richard-pebody"                "steve-owens"                  
#> [143] "anthony-kessel"                "henry-kippin"                 
#> [145] "dr-sotiris-vardoulakis"        "charlotte-mcclymont"          
#> [147] "luke-hounsome"                 "charlotte-eley"               
#> [149] "vicki-young"                   "lorraine-oldridge"            
#> [151] "jolyon-medlock"                "alexander-vaux"               
#> [153] "01"                            "sani-dimitroulopoulou"        
#> [155] "lily-makurah"                  "susan-hopkins"                
#> [157] "jim-obrien"                    "peter-bradley"                
#> [159] "chris-carrigan"                "michael-heasman"              
#> [161] "david-madden"                  "helen-duncan"                 
#> [163] "julian-flowers"                "kayleigh-hansford"            
#> [165] "matthew-dryden"                "sonia-roschnik"               
#> [167] "julia-verne"                   "jenny-godson"                 
#> [169] "julia-csikar"                  "katie-carmichael"             
#> [171] "catherine-heffernan"           "joanne-bosanquet"             
#> [173] "sue-ibbotson"                  "sarah-stevens"                
#> [175] "neil-mccoll"                   "neil-bentley"                 
#> [177] "ann-tonks"                     "judith-rankin"                
#> [179] "brian-mccloskey"               "dr-jo-whaley"                 
#> [181] "emma-aarons"                   "ibrahim-abubakar"             
#> [183] "melvina-woode-owusu"           "julietta-patnick"             
#> [185] "naomi-holman"                  "jane-anderson"                
#> [187] "ann-hoskins"                   "susie-singleton"              
#> [189] "mahesh-patel"                  "sophie"                       
#> [191] "the-mythbuster"                "christine-harvey"             
#> [193] "david-rhodes"                  "mark-salter"                  
#> [195] "meng-khaw"                     "claire-currie"                
#> [197] "geraldine-oliver"              "don-sinclair"                 
#> [199] "helen-shaw"                    "hilary-kirkbride"             
#> [201] "sally-warren"                  "isabel-oliver"                
#> [203] "alex-elliot"                   "heather-lodge"                
#> [205] "caroline-hancock"              "hywell-dinsdale"              
#> [207] "andrew-cooper"                 "rashmi-shukla"                
#> [209] "jonathan-marron"               "nick-phin"                    
#> [211] "muir-gray"                     "david-heymann"

Blog text

We can extract blog posts for any given category e.g. the-week-at-phe

or Duncan’s friday message

PHE publications

Its hard to get everything PHE has published on .GVO>UK. To assist this process we have written a function which produces an interactive table of all the PHE publications by category (NB at the moment it is over inclusive). This makes use of the DT package and allows us to add download options so the data can be downloaded in various forms.